Getting started with xdatasets#
The xdatasets library enables users to effortlessly access a vast collection of earth observation datasets that are compatible with xarray formats.
The library adopts an opinionated approach to data querying and caters to the specific needs of certain user groups, such as hydrologists, climate scientists, and engineers. One of the functionalities of xdatasets is the ability to extract data at a specific location or within a designated region, such as a watershed or municipality, while also enabling spatial and temporal operations.
To use xdatasets, users must employ a query. A straightforward query to extract the variables t2m (2m temperature) and tp (Total precipitation) from the era5_reanalysis_single_levels dataset at two geographical positions (Montreal and Toronto) could be as follows:
query = {
"variables": {"era5_reanalysis_single_levels": ["t2m", "tp"]},
"space": {
"clip": "point", # bbox, point or polygon
"geometry": {'Montreal' : (45.508888, -73.561668),
'Toronto' : (43.651070, -79.347015)
}
}
}
An example of a more complex query would look like this. This query calls the same variables as above. However, instead of specifying geographical positions, GeoPandas.DataFrame is used to provide features (such as shapefiles or geojson) for extracting data within each of them. Each polygon is identified using the unique identifier Station, and a spatial average is computed within each one (aggregation: True). The dataset, initially at an hourly time step, is converted into a daily time
step while applying one or more temporal aggregations for each variable as prescribed in the query. The xdatasets function ultimately returns the dataset for the specified date range and time zone.
query = {
"variables": {"era5_reanalysis_single_levels": ["t2m", "tp"]},
"space": {
"clip": "polygon", # bbox, point or polygon
"aggregation": False,
"geometry": gdf,
"unique_id": "Station"
},
"time": {
"timestep": "D",
"aggregation": {"tp": np.nansum,
"t2m": [np.nanmax, np.nanmin]},
"start": '1959-01-01',
"end": '1961-05-31',
"timezone": 'America/Montreal',
},
}
Don't worry! Additional examples below will help in comprehending the range of possible queries.
Note
Don’t worry! Additional examples below will help in comprehending the range of possible queries.
[1]:
## Query climate datasets
[2]:
import xdatasets as xd
import intake
import numpy as np
import geopandas as gpd
import pandas as pd
from pathlib import Path
[3]:
bucket = Path('https://s3.us-east-2.wasabisys.com/watersheds-polygons/MELCC/json')
paths = [bucket.joinpath('023003/023003.json'),
bucket.joinpath('031101/031101.json'),
bucket.joinpath('040111/040111.json')]
[4]:
gdf = pd.concat([gpd.read_file(path).to_crs(4326) for path in paths]).reset_index(drop=True)
gdf
[4]:
| Station | Superficie | geometry | |
|---|---|---|---|
| 0 | 023003 | 208.4591919813271 | POLYGON ((-70.82601 46.81658, -70.82728 46.815... |
| 1 | 031101 | 111.7131058782722 | POLYGON ((-73.98519 45.21072, -73.98795 45.209... |
| 2 | 040111 | 433.440893903503 | POLYGON ((-74.06645 46.02253, -74.06647 46.022... |
[5]:
gdf.hvplot(geo=True,
tiles='ESRI',
color='Station',
alpha=0.8,
width=750,
height=450,
legend='top',
hover_cols=['Station','Superficie'])
[5]:
[6]:
%%time
# http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
query = {
"variables": {"era5_reanalysis_single_levels": ["t2m", "tp"]},
"space": {
"clip": "polygon", # bbox, point or polygon
"aggregation": False,
"geometry": gdf,
"unique_id": "Station"
},
"time": {
# "timestep": "D",
# "aggregation": {"tp": np.nansum,
# "t2m": [np.nanmax, np.nanmin]},
"start": '1979-01-01',
"end": '2020-12-31',
# "timezone": 'America/Montreal',
},
}
xds = xd.Dataset(**query)
3it [00:26, 8.90s/it]
<xarray.Dataset>
Dimensions: (latitude: 12, longitude: 13, time: 368184, Station: 3)
Coordinates:
* latitude (latitude) float32 44.5 44.75 45.0 45.25 ... 46.75 47.0 47.25
* longitude (longitude) float32 -75.0 -74.75 -74.5 ... -70.75 -70.5 -70.25
* time (time) datetime64[ns] 1979-01-01 ... 2020-12-31T23:00:00
Dimensions without coordinates: Station
Data variables:
t2m (Station, time, latitude, longitude) float32 nan nan ... nan nan
tp (Station, time, latitude, longitude) float32 nan nan ... nan nan
Attributes:
institution: ECMWF
source: Reanalysis
title: ERA5 forecasts
CPU times: user 13.7 s, sys: 5.41 s, total: 19.1 s
Wall time: 28.1 s
[7]:
ds_clipped = xds.bbox_clip(xds.data.isel(Station=0))
ds_clipped
[7]:
<xarray.Dataset>
Dimensions: (time: 368184, latitude: 6, longitude: 6)
Coordinates:
* latitude (latitude) float32 46.0 46.25 46.5 46.75 47.0 47.25
* longitude (longitude) float32 -71.5 -71.25 -71.0 -70.75 -70.5 -70.25
* time (time) datetime64[ns] 1979-01-01 ... 2020-12-31T23:00:00
Data variables:
t2m (time, latitude, longitude) float32 272.8 272.1 ... 269.3 269.1
tp (time, latitude, longitude) float32 nan nan nan ... 0.0 0.0 0.0
Attributes:
institution: ECMWF
source: Reanalysis
title: ERA5 forecasts[8]:
ds_clipped.t2m.hvplot()
[8]:
[ ]:
[9]:
%%time
# http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
query = {
"variables": {"era5_reanalysis_single_levels": ["t2m", "tp"]},
"space": {
"clip": "polygon", # bbox, point or polygon
"aggregation": True,
"geometry": gdf,
"unique_id": "Station"
},
"time": {
"timestep": "D",
"aggregation": {"tp": [np.nansum],
"t2m": [np.nanmax, np.nanmin]},
"start": '1979-01-01',
"end": '2020-12-31',
"timezone": 'America/Montreal',
},
}
xds = xd.Dataset(**query)
3it [00:30, 10.32s/it]
<xarray.Dataset>
Dimensions: (time: 368184, Station: 3)
Coordinates:
* time (time) datetime64[ns] 1978-12-31T19:00:00 ... 2020-12-31T18:0...
longitude (Station) float64 -70.94 -74.14 -74.27
latitude (Station) float64 46.72 45.18 46.08
geom (Station) int64 0 1 2
* Station (Station) object '023003' '031101' '040111'
Superficie (Station) object '208.4591919813271' ... '433.440893903503'
Data variables:
t2m (time, Station) float32 268.8 274.4 269.7 ... 269.3 271.3 266.5
tp (time, Station) float32 nan nan nan nan nan ... 0.0 0.0 0.0 0.0
Attributes:
institution: ECMWF
source: Reanalysis
title: ERA5 forecasts
timezone: America/Montreal
regrid_method: conservative
Processing tp: 100%|██████████| 2/2 [00:07<00:00, 3.59s/it]
CPU times: user 22.5 s, sys: 1.26 s, total: 23.7 s
Wall time: 39 s
[10]:
xds.data.to_dataframe().hvplot(x='time',
y=['t2m_nanmax','t2m_nanmin'],
grid=True,
width=750,
height=450,
groupby='Station',
legend='top',
widget_location='bottom')
[10]:
[11]:
xds.data.to_dataframe().hvplot(x='time',
y=['tp_nansum'],
grid=True,
width=750,
height=450,
groupby='Station',
legend='top',
widget_location='bottom')
[11]:
[12]:
xds.data
[12]:
<xarray.Dataset>
Dimensions: (Station: 3, time: 15342)
Coordinates:
longitude (Station) float64 -70.94 -74.14 -74.27
latitude (Station) float64 46.72 45.18 46.08
geom (Station) int64 0 1 2
* Station (Station) object '023003' '031101' '040111'
Superficie (Station) object '208.4591919813271' ... '433.440893903503'
* time (time) datetime64[ns] 1978-12-31 1979-01-01 ... 2020-12-31
Data variables:
t2m_nanmax (time, Station) float32 271.1 276.4 271.5 ... 273.7 276.1 272.7
t2m_nanmin (time, Station) float32 268.8 274.4 269.7 ... 269.0 271.3 266.3
tp_nansum (time, Station) float32 0.0 0.0 0.0 ... 0.002386 0.002362
Attributes:
institution: ECMWF
source: Reanalysis
title: ERA5 forecasts
timezone: America/Montreal[ ]: